feat: GPU compatibility tier system with boundary testing #417

ChuxiJ · 2026-02-10T07:40:44Z

Summary

Add a comprehensive GPU tier configuration system and boundary testing framework to determine minimum VRAM requirements for different optimization levels.

GPU Tier System (acestep/gpu_config.py)

Add GPUConfig dataclass with per-tier settings for quantization, offload, LM models, batch size, and duration limits
Implement automatic GPU tier detection based on available VRAM
Support VRAM simulation via MAX_CUDA_VRAM environment variable with hard VRAM cap enforcement using torch memory fraction
Define tiers: tier1(4GB), tier2(6GB), tier3(8GB), tier4(12GB), tier5(14GB), tier6a(16GB), tier6b(24GB), unlimited(48GB+)

Boundary Testing (profile_inference.py)

Add --tier-boundary flag to tier-test mode for automated boundary analysis across all VRAM tiers
Refactor tier test logic into reusable _run_single_tier_test()
Test three variants per tier: default, no-quant, no-offload
Smart skipping when tier already disables the tested optimization
Add _print_boundary_summary() with clear results table

Boundary Test Results

No INT8 Quantization: minimum tier2 (6GB), peak 4.91GB
No CPU Offload: minimum tier3 (8GB), peak 7.30GB

Handler & UI Updates

Enhanced model offload/load context management in handler.py
Updated Gradio UI to expose GPU tier settings
Updated API server for tier-aware configuration
Improved nano-vllm model runner compatibility

Documentation

Updated GPU_COMPATIBILITY docs (en/zh/ja/ko)
Updated BENCHMARK docs (en/zh) with tier-boundary CLI reference
Updated INFERENCE, INSTALL, GRADIO_GUIDE docs across all languages
Updated README with GPU tier information

Summary by CodeRabbit

New Features
- UI now auto-selects recommended LM, backend, offload, quantization and updates audio duration/batch sliders by detected GPU tier (including tier-aware recommendations).
Improvements
- New multi-column GPU compatibility guidance (more granular VRAM bands, Backend/Notes, Tier 6a/6b).
- Tier-test profiling mode and VRAM profiling utility for automated tier/boundary validation.
- Raised auto-offload threshold (~20GB) and safer LM downgrade for very-low-VRAM GPUs.
- Enhanced VRAM guard: adaptive chunk sizing, batch clamping, and CPU fallback for VAE decode.
Documentation
- Expanded localized docs (EN/JA/KO/ZH) with tier-aware defaults and testing guides.

## Summary Add a comprehensive GPU tier configuration system and boundary testing framework to determine minimum VRAM requirements for different optimization levels. ## GPU Tier System (acestep/gpu_config.py) - Add GPUConfig dataclass with per-tier settings for quantization, offload, LM models, batch size, and duration limits - Implement automatic GPU tier detection based on available VRAM - Support VRAM simulation via MAX_CUDA_VRAM environment variable with hard VRAM cap enforcement using torch memory fraction - Define tiers: tier1(4GB), tier2(6GB), tier3(8GB), tier4(12GB), tier5(14GB), tier6a(16GB), tier6b(24GB), unlimited(48GB+) ## Boundary Testing (profile_inference.py) - Add --tier-boundary flag to tier-test mode for automated boundary analysis across all VRAM tiers - Refactor tier test logic into reusable _run_single_tier_test() - Test three variants per tier: default, no-quant, no-offload - Smart skipping when tier already disables the tested optimization - Add _print_boundary_summary() with clear results table ## Boundary Test Results - No INT8 Quantization: minimum tier2 (6GB), peak 4.91GB - No CPU Offload: minimum tier3 (8GB), peak 7.30GB ## Handler & UI Updates - Enhanced model offload/load context management in handler.py - Updated Gradio UI to expose GPU tier settings - Updated API server for tier-aware configuration - Improved nano-vllm model runner compatibility ## Documentation - Updated GPU_COMPATIBILITY docs (en/zh/ja/ko) - Updated BENCHMARK docs (en/zh) with tier-boundary CLI reference - Updated INFERENCE, INSTALL, GRADIO_GUIDE docs across all languages - Updated README with GPU tier information

coderabbitai · 2026-02-10T07:41:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds extensive VRAM-aware GPU tiering, adaptive GPUConfig defaults (LM, backend, offload, quantization), LM selection/downgrade rules, VAE/decoding VRAM fallbacks, VRAM simulation and profiling tools (tier-test, profile_vram), Gradio UI tier-aware defaults/limits, and broad docs updates. No model algorithm changes.

Changes

Cohort / File(s)	Summary
GPU Configuration Core `acestep/gpu_config.py`	Major expansion: new VRAM constants (VRAM_AUTO_OFFLOAD_THRESHOLD_GB), tier split (tier6a/tier6b) and alias, extended GPUConfig fields (recommended_lm_model, lm_backend_restriction, recommended_backend, offload_/quantization_/compile_ defaults), VRAM accounting helpers, adaptive-config calc, LM-size selection helpers, debug/MAX_CUDA_VRAM support.
Pipeline & API Integration `acestep/acestep_v15_pipeline.py`, `acestep/api_server.py`	Replace hard-coded 16GB checks with VRAM_AUTO_OFFLOAD_THRESHOLD_GB; add 4B→1.7B downgrade when GPU memory below threshold; surface new constant via imports and update messages.
UI Integration `acestep/gradio_ui/interfaces/generation.py`, `acestep/gradio_ui/events/generation_handlers.py`, `acestep/gradio_ui/events/__init__.py`	Tier-aware UI defaults and constraints: available LM models/backends filtered by tier, recommended model resolution from disk, backend restrictions, dynamic duration/batch updates after init, and enriched user info/status messages.
Memory Management & Safety `acestep/handler.py`	VAE_DECODE_MAX_CHUNK_SIZE, VRAM-aware auto chunk sizing, _vram_guard_reduce_batch to auto-reduce batch, _decode_on_cpu CPU fallback, tiled-decode OOM cascades, and expanded VRAM logging/warnings.
vLLM / KV-budget safety `acestep/third_parts/nano-vllm/.../model_runner.py`	Add MAX_CUDA_VRAM simulation hook, enforce minimum post-KV free VRAM, and emit diagnostics/warnings when post-allocation free VRAM < 1GB.
Profiling & Tier Testing `profile_inference.py`	New tier-test mode and CLI flags (--tiers, --tier-with-lm, --tier-skip-compile, --tier-boundary, --tier-batch-boundary): automates per-tier simulation, model selection, short infer tests, boundary/batch analysis, and summaries.
VRAM Profiling Utility `scripts/profile_vram.py`	New script for component-level VRAM profiling (DiT, VAE, text encoder, LM), memory stats, OOM-safe measurement loops, and optional JSON output to aid VRAM calibration.
Docs & Guides (EN/JA/KO/ZH) + README `README.md`, `docs/**`	Replace simple GPU table with richer backend-aware tier tables and Adaptive UI Defaults; document VRAM guard, adaptive VAE decode, auto chunk sizing, MAX_CUDA_VRAM debug usage, tier-test and boundary testing; update install/guide text and examples.
UI wiring minor `acestep/gradio_ui/events/__init__.py`	Extend init_btn outputs to include `audio_duration` and `batch_size_input` updates after initialization.

Sequence Diagram(s)

sequenceDiagram
    participant UI as Gradio UI
    participant API as API Server / Pipeline
    participant GPU as gpu_config
    participant Disk as Disk models
    participant LM as LM Loader
    participant VAE as VAE/DiT runtime

    UI->>API: init request (init_params)
    API->>GPU: probe get_gpu_memory_gb(), get_gpu_tier()
    GPU-->>API: GPUConfig (recommended_lm, backend, offload, quantization, limits)
    API->>Disk: find_best_lm_model_on_disk(recommended_lm)
    Disk-->>API: chosen_model or none
    API->>LM: attempt LM init (selected model, backend)
    alt LM too large or backend restricted
        LM-->>API: fail / downgrade -> API disables LM or selects smaller model
    end
    API->>UI: return tier-derived UI updates (duration, batch, warnings)
    UI->>API: start generation
    API->>GPU: estimate_inference_vram()
    alt estimated > available
        API->>API: _vram_guard_reduce_batch -> adjust batch/duration
        API->>VAE: use adaptive chunk size or perform _decode_on_cpu on OOM
    end
    API-->>UI: generation results or OOM diagnostics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Merge mainline commits as of 2026/02/08 05:16 UTC with MPS optimizations and do optimization checks #322: Overlaps GPU/init and gpu_config export changes (VRAM thresholds, startup LM logic).

Suggested reviewers

ChuxiJ

Poem

🐰
I hopped through tiers of VRAM bright,
I nudged the UI, tuned models right.
When memory's thin I gently shrink,
I stitch fallbacks so things won't sink.
Hooray — safe runs, with carrots in sight. 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: introducing a GPU compatibility tier system with boundary testing capabilities across the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 80.88% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/gpu-compatibility-tier-boundary-testing

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py (1)
247-260: ⚠️ Potential issue | 🟠 Major

Reserve can be bypassed when available_for_kv_cache <= 0.
With MAX_CUDA_VRAM simulation or high current usage, the fallback (free * 0.5) plus max(1, …) still allocates KV cache, defeating the 1 GB reserve and risking immediate OOM. Consider clamping to zero and letting the existing guard raise when there’s no headroom.
🛠️ Proposed fix to respect the reserve and fail fast
-        if available_for_kv_cache <= 0:
-            available_for_kv_cache = free * 0.5  # Fallback to 50% of free memory
-
-        config.num_kvcache_blocks = max(1, int(available_for_kv_cache) // block_bytes)
-        if config.num_kvcache_blocks <= 0:
+        if available_for_kv_cache <= 0:
+            available_for_kv_cache = 0
+
+        config.num_kvcache_blocks = int(available_for_kv_cache) // block_bytes
+        if config.num_kvcache_blocks <= 0:
             raise RuntimeError(
                 f"Insufficient GPU memory for KV cache. "
                 f"Free: {free / 1024**3:.2f} GB, Current: {current / 1024**3:.2f} GB, "
                 f"Available for KV: {available_for_kv_cache / 1024**3:.2f} GB, "
                 f"Block size: {block_bytes / 1024**2:.2f} MB"
             )

🤖 Fix all issues with AI agents

In `@acestep/handler.py`:
- Around line 1608-1617: The VRAM guard in _vram_guard_reduce_batch is checking
self.config_path which initialize_service never sets, so base-model detection
never triggers; update the check to use the existing config object (e.g.,
self.config) instead—inspect self.config.is_turbo or other fields on self.config
to determine base vs turbo and multiply per_sample_gb by 2.0 when appropriate;
ensure this logic is applied where per_sample_gb is computed in
_vram_guard_reduce_batch and remove or stop relying on self.config_path, or set
self.config_path during initialize_service if you prefer that pattern.
- Around line 3709-3718: The VRAM auto-check erroneously runs on non-CUDA
backends (MPS/XPU) because get_effective_free_vram_gb() returns 0 when
torch.cuda.is_available() is false, forcing VAE decode to CPU; change the logic
in the generate_music VAE decision block to only call
get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when
torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a
cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env
override and the _vae_cpu variable behavior so only CUDA devices can auto-enable
CPU VAE decode.

In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py`:
- Around line 269-282: The f-string log in model_runner.py uses a Unicode
multiplication sign (×) which triggers RUF001 and can cause copy/paste/terminal
issues; update the print statement that formats KV cache info (the one
referencing config.num_kvcache_blocks, self.block_size, max_tokens_capacity,
kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free)
to replace the Unicode "×" with a plain ASCII "x" character so the message
becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens =
..." while keeping the rest of the formatting unchanged.

In `@docs/en/ace_step_musicians_guide.md`:
- Around line 157-160: Update the enthusiast tier entry so its batch-size range
follows the tier progression: locate the line containing "16-20 GB (enthusiast)"
and the phrase "1-4 songs at a time" and change it to "2-4 songs at a time"
(keeping the rest of the text, e.g., "Songs up to 10 minutes" and "Larger
Songwriter brain (1.7B)" unchanged) so the lower bound is consistent with the
8-12GB and 12-16GB tiers.

In `@docs/en/BENCHMARK.md`:
- Around line 160-223: The sample output code fence under the "tier-test"
section (the "TIER TEST RESULTS" block) lacks a language tag; update the opening
fence from ``` to ```text to satisfy MD040 and Markdown linting, leaving the
fence contents and closing ``` unchanged so the block is explicitly marked as
plain text.

In `@docs/zh/BENCHMARK.md`:
- Around line 187-199: The fenced code block that begins with "TIER TEST
RESULTS" is missing a language specifier; update the opening fence from ``` to
```text (or another appropriate spec like ```console) so syntax highlighters and
accessibility tools recognize it—modify the code block in the
docs/zh/BENCHMARK.md content around the "TIER TEST RESULTS" section to include
the language tag on the opening backticks.

In `@profile_inference.py`:
- Line 1072: The print call using an unnecessary f-string should be changed to a
regular string: locate the statement print(f"\n  --- Variant: default ---") in
profile_inference.py and remove the leading f so it becomes print("\n  ---
Variant: default ---"); no other behavior changes are needed.
- Around line 1092-1109: Fix the unnecessary f-string prefixes on print
statements that have no interpolations: replace print(f"...") with print("...")
for the messages around the "no-quant" and "no-offload" variants in the block
that calls _run_single_tier_test; specifically update the print calls that
reference the no-quant and no-offload messages which use
gpu_config.quantization_default and gpu_config.offload_to_cpu_default to
determine skips so they are ordinary string literals instead of f-strings.
- Around line 850-853: Remove the now-unnecessary noqa by deleting the "# noqa:
F401" comment on the import flash_attn line in the try block (the import and the
subsequent use of use_flash_attention already satisfy linter rules), i.e.,
update the import flash_attn statement so it no longer includes the noqa
directive.
- Around line 1276-1286: The two print calls that use f-strings without
placeholders should be regular strings: in the block that references
capability_name, failing, and passing (the one that prints the boundary summary
and returns passing[0] if any), replace print(f"    ❌ No tier passed this test.
All tested tiers failed.") and print(f"    ⚠️ No test results available for this
capability.") with print("    ❌ No tier passed this test. All tested tiers
failed.") and print("    ⚠️ No test results available for this capability.")
respectively so the unnecessary f-string prefixes are removed.

In `@scripts/profile_vram.py`:
- Around line 318-321: The code builds encoder_path =
os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime
checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the
encoder; update the logic in scripts/profile_vram.py around encoder_path to try
the runtime checkpoint name as a fallback (check for
os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the
"text_encoder" path is missing) and only return {} after both attempts fail, or
prefer the runtime-named directory when present; ensure references to
encoder_path, checkpoint_dir and the literal names ("text_encoder",
"Qwen3-Embedding-0.6B") are used so the handler and this script align.
- Around line 165-183: The current DiT profiling only allocates and deletes
dummy tensors (noise, text_hidden, text_mask) and never executes the model, so
peak memory misses activation usage; replace the no-op block with a minimal
forward pass by calling the DiT model (e.g., model(noise, text_hidden,
text_mask) or model.forward(...)) inside the torch.inference_mode() context so
activations are allocated and measured, and when has_cfg is true duplicate the
inputs (noise_cfg, text_hidden_cfg, text_mask_cfg) and pass the doubled batch to
the model to simulate classifier-free guidance; alternatively, if you
intentionally only want to measure input allocation, rename peak_inference_gb to
peak_input_allocation_gb to reflect the narrower measurement.

🧹 Nitpick comments (7)

acestep/acestep_v15_pipeline.py (1)
212-223: String replacement for model downgrade is brittle.

The model path downgrade using replace("4B", "1.7B") assumes a specific naming pattern. If a model path contains "4B" elsewhere (e.g., in a directory name or version suffix), this could produce unexpected results.

Consider using a more robust approach that validates the replacement actually targets the model size portion of the path:
🛡️ Suggested improvement
     if args.lm_model_path and 0 < gpu_memory_gb < VRAM_AUTO_OFFLOAD_THRESHOLD_GB:
         if "4B" in args.lm_model_path:
-            # Downgrade to 1.7B if available
-            fallback = args.lm_model_path.replace("4B", "1.7B")
+            # Downgrade to 1.7B if available - only replace in model name portion
+            import re
+            # Match "4B" that appears to be a model size (preceded by - or lm-)
+            fallback = re.sub(r'(lm-|-)4B\b', r'\g<1>1.7B', args.lm_model_path)
+            if fallback == args.lm_model_path:
+                # Fallback didn't change anything meaningful, skip downgrade warning
+                fallback = None
acestep/gradio_ui/events/generation_handlers.py (1)

517-548: Minor: Duplicate get_global_gpu_config() call.

get_global_gpu_config() is called at line 450 and again at line 518. Since the GPU config is a singleton that doesn't change during initialization, you could reuse the earlier reference.

This is a very minor optimization and doesn't affect correctness.
docs/zh/GPU_COMPATIBILITY.md (1)
141-149: Add language specifier to fenced code block.

The code block showing the boundary analysis output is missing a language specifier. Since this is plain text output, use text or plaintext to satisfy the markdown linter.
📝 Proposed fix
-```
+```text
 BOUNDARY ANALYSIS
 =================
acestep/gradio_ui/interfaces/generation.py (1)
174-181: Consider disabling the LM checkbox for unsupported tiers.

The info text warns that LM is unavailable for low-VRAM tiers, but the checkbox remains interactive. Consider setting interactive=False when gpu_config.available_lm_models is empty to prevent users from enabling a non-functional feature.
♻️ Proposed enhancement
+                lm_interactive = bool(gpu_config.available_lm_models)
                 init_llm_checkbox = gr.Checkbox(
                     label=t("service.init_llm_label"),
                     value=init_llm_value,
                     info=lm_info_text,
+                    interactive=lm_interactive,
                 )
profile_inference.py (1)
1126-1152: Consider logging exceptions during handler cleanup instead of silent pass.

The try-except-pass pattern silently swallows all exceptions during cleanup. While cleanup should be resilient, logging at DEBUG level helps diagnose issues during development without cluttering normal output.
♻️ Proposed enhancement
+import logging
+
+logger = logging.getLogger(__name__)
+
 def _cleanup_handlers(dit_handler, llm_handler):
     """Clean up handlers and free GPU memory."""
     try:
         if dit_handler is not None:
             if hasattr(dit_handler, 'model') and dit_handler.model is not None:
                 dit_handler.model = None
             if hasattr(dit_handler, 'vae') and dit_handler.vae is not None:
                 dit_handler.vae = None
             if hasattr(dit_handler, 'text_encoder') and dit_handler.text_encoder is not None:
                 dit_handler.text_encoder = None
             del dit_handler
-    except Exception:
-        pass
+    except Exception as e:
+        logger.debug("DiT handler cleanup error (non-fatal): %s", e)

     try:
         if llm_handler is not None:
             if hasattr(llm_handler, 'llm') and llm_handler.llm is not None:
                 llm_handler.llm = None
             del llm_handler
-    except Exception:
-        pass
+    except Exception as e:
+        logger.debug("LLM handler cleanup error (non-fatal): %s", e)
acestep/handler.py (1)
1581-1586: Remove or use unused use_lm parameter.

use_lm is unused and triggers lint warnings. Either wire it into the estimate (LM overhead) or drop it from the signature.
🛠️ Proposed fix (remove if unused)
-        audio_duration: Optional[float] = None,
-        use_lm: bool = False,
+        audio_duration: Optional[float] = None,
acestep/gpu_config.py (1)
792-805: Ensure adaptive recommended LM is actually available.

compute_adaptive_config picks recommended_lm_model from tier defaults even when the VRAM-budgeted available_lm_models list is smaller. That can recommend a model that doesn’t fit the computed budget. Consider clamping to the largest available model when the tier default isn’t in available_lm_models.
🛠️ Proposed fix
-    return GPUConfig(
+    recommended_model = tier_config.get("recommended_lm_model", "")
+    if recommended_model not in available_lm_models:
+        recommended_model = available_lm_models[-1] if available_lm_models else ""
+    return GPUConfig(
         tier=tier,
         gpu_memory_gb=total_vram_gb,
         max_duration_with_lm=max_dur_lm,
         max_duration_without_lm=max_dur_no_lm,
         max_batch_size_with_lm=max_batch_with_lm,
         max_batch_size_without_lm=max_batch_no_lm,
         init_lm_default=bool(available_lm_models),
         available_lm_models=available_lm_models,
-        recommended_lm_model=tier_config.get("recommended_lm_model", available_lm_models[0] if available_lm_models else ""),
+        recommended_lm_model=recommended_model,
         lm_backend_restriction=tier_config.get("lm_backend_restriction", "all"),
         recommended_backend=tier_config.get("recommended_backend", "vllm"),
         offload_to_cpu_default=tier_config.get("offload_to_cpu_default", True),
         offload_dit_to_cpu_default=tier_config.get("offload_dit_to_cpu_default", True),
         quantization_default=tier_config.get("quantization_default", True),
         compile_model_default=tier_config.get("compile_model_default", True),
         lm_memory_gb=lm_memory_gb,
     )

coderabbitai · 2026-02-10T07:50:21Z

acestep/handler.py

+        # Estimate per-sample activation cost for DiT
+        duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0
+        # Empirical: ~0.8 GB per sample at 60s, linear scaling
+        per_sample_gb = 0.8 * (duration_sec / 60.0)
+        # If using cfg (base model), double the per-sample cost
+        if hasattr(self, 'model') and self.model is not None:
+            model_name = getattr(self, 'config_path', '') or ''
+            if 'base' in model_name.lower():
+                per_sample_gb *= 2.0
+


⚠️ Potential issue | 🟠 Major

Base-model detection in VRAM guard never triggers.

_vram_guard_reduce_batch checks self.config_path, but initialize_service never sets it. That means base models won’t double the per-sample estimate, so the guard can allow oversized batches and still OOM. Consider using self.config.is_turbo (or storing config_path during init) instead.

🛠️ Proposed fix (use config instead of config_path)

- if hasattr(self, 'model') and self.model is not None: - model_name = getattr(self, 'config_path', '') or '' - if 'base' in model_name.lower(): - per_sample_gb *= 2.0 + if self.model is not None and self.config is not None: + if not getattr(self.config, "is_turbo", False): + per_sample_gb *= 2.0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Estimate per-sample activation cost for DiT

duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0

# Empirical: ~0.8 GB per sample at 60s, linear scaling

per_sample_gb = 0.8 * (duration_sec / 60.0)

# If using cfg (base model), double the per-sample cost

if hasattr(self, 'model') and self.model is not None:

model_name = getattr(self, 'config_path', '') or ''

if 'base' in model_name.lower():

per_sample_gb *= 2.0

# Estimate per-sample activation cost for DiT

duration_sec = float(audio_duration) if audio_duration and float(audio_duration) > 0 else 60.0

# Empirical: ~0.8 GB per sample at 60s, linear scaling

per_sample_gb = 0.8 * (duration_sec / 60.0)

# If using cfg (base model), double the per-sample cost

if self.model is not None and self.config is not None:

if not getattr(self.config, "is_turbo", False):

per_sample_gb *= 2.0

🤖 Prompt for AI Agents

In `@acestep/handler.py` around lines 1608 - 1617, The VRAM guard in _vram_guard_reduce_batch is checking self.config_path which initialize_service never sets, so base-model detection never triggers; update the check to use the existing config object (e.g., self.config) instead—inspect self.config.is_turbo or other fields on self.config to determine base vs turbo and multiply per_sample_gb by 2.0 when appropriate; ensure this logic is applied where per_sample_gb is computed in _vram_guard_reduce_batch and remove or stop relying on self.config_path, or set self.config_path during initialize_service if you prefer that pattern.

coderabbitai · 2026-02-10T07:50:21Z

acestep/handler.py

+                    # Check effective free VRAM and auto-enable CPU decode if extremely tight
                    import os as _os
                    _vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")
+                    if not _vae_cpu:
+                        _effective_free = get_effective_free_vram_gb()
+                        logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")
+                        # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM
+                        if _effective_free < 0.5:
+                            logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")
+                            _vae_cpu = True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the get_effective_free_vram_gb function definition rg "def get_effective_free_vram_gb" --type py -A 15

Repository: ace-step/ACE-Step-1.5

Length of output: 1003

🏁 Script executed:

# Also search for where this function is imported or used rg "get_effective_free_vram_gb" --type py | head -20

Repository: ace-step/ACE-Step-1.5

Length of output: 733

🏁 Script executed:

# Check if there are device-specific memory utilities (e.g., for MPS) rg "mps|MPS" acestep/handler.py -B 2 -A 2 | head -50

Repository: ace-step/ACE-Step-1.5

Length of output: 3029

🏁 Script executed:

# Inspect the context around lines 3709-3718 to understand self.device sed -n '3680,3730p' acestep/handler.py

Repository: ace-step/ACE-Step-1.5

Length of output: 3186

🏁 Script executed:

# Get the full implementation of get_effective_free_vram_gb rg "def get_effective_free_vram_gb" --type py -A 50

Repository: ace-step/ACE-Step-1.5

Length of output: 3404

🏁 Script executed:

# Also check if there are any MPS memory utilities in the codebase rg "torch.backends.mps|torch.mps" --type py | grep -i memory

Repository: ace-step/ACE-Step-1.5

Length of output: 257

Gate VRAM check to CUDA devices only—otherwise MPS/XPU are forced to CPU decode.

get_effective_free_vram_gb() immediately returns 0 when torch.cuda.is_available() is False, so on MPS and XPU devices, _effective_free < 0.5 always evaluates true and unconditionally forces VAE decode to CPU. This degrades performance on systems with only MPS or XPU acceleration.

Gate the memory check to CUDA devices since the function is CUDA-specific:

🛠️ Proposed fix (gate by CUDA)

- if not _vae_cpu: - _effective_free = get_effective_free_vram_gb() - logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB") - # If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM - if _effective_free < 0.5: - logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode") - _vae_cpu = True + is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda")) + if not _vae_cpu and is_cuda: + _effective_free = get_effective_free_vram_gb() + logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB") + if _effective_free < 0.5: + logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode") + _vae_cpu = True

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check effective free VRAM and auto-enable CPU decode if extremely tight

import os as _os

_vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")

if not _vae_cpu:

_effective_free = get_effective_free_vram_gb()

logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")

# If less than 0.5 GB free, VAE decode on GPU will almost certainly OOM

if _effective_free < 0.5:

logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")

_vae_cpu = True

# Check effective free VRAM and auto-enable CPU decode if extremely tight

import os as _os

_vae_cpu = _os.environ.get("ACESTEP_VAE_ON_CPU", "0").lower() in ("1", "true", "yes")

is_cuda = self.device == "cuda" or (isinstance(self.device, str) and self.device.startswith("cuda"))

if not _vae_cpu and is_cuda:

_effective_free = get_effective_free_vram_gb()

logger.info(f"[generate_music] Effective free VRAM before VAE decode: {_effective_free:.2f} GB")

if _effective_free < 0.5:

logger.warning(f"[generate_music] Only {_effective_free:.2f} GB free VRAM — auto-enabling CPU VAE decode")

_vae_cpu = True

🤖 Prompt for AI Agents

In `@acestep/handler.py` around lines 3709 - 3718, The VRAM auto-check erroneously runs on non-CUDA backends (MPS/XPU) because get_effective_free_vram_gb() returns 0 when torch.cuda.is_available() is false, forcing VAE decode to CPU; change the logic in the generate_music VAE decision block to only call get_effective_free_vram_gb() and apply the _effective_free < 0.5 gate when torch.cuda.is_available() is true (i.e., wrap the effective-free-VRAM check in a cuda-available conditional), while preserving the ACESTEP_VAE_ON_CPU env override and the _vae_cpu variable behavior so only CUDA devices can auto-enable CPU VAE decode.

coderabbitai · 2026-02-10T07:50:21Z

acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py

+        # If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len
+        post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3
+        if post_kv_free < 1.0:
+            print(
+                f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "
+                f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."
+            )
+
        print(
            f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "
            f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "
            f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "
-            f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)"
+            f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "
+            f"post_kv_free: {post_kv_free:.2f} GB)"


⚠️ Potential issue | 🟡 Minor

Replace the Unicode multiplication sign in the log line.
It triggers RUF001 and can cause copy/paste issues in terminals—use plain x.

✏️ Suggested tweak

- f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = " + f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = "

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len

post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3

if post_kv_free < 1.0:

print(

f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "

f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."

)

print(

f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks × {self.block_size} tokens = "

f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "

f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "

f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB)"

f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "

f"post_kv_free: {post_kv_free:.2f} GB)"

# If KV cache would leave less than 1 GB free, warn and suggest reducing max_model_len

post_kv_free = (free - config.num_kvcache_blocks * block_bytes) / 1024**3

if post_kv_free < 1.0:

print(

f"[nanovllm] WARNING: After KV cache allocation, only {post_kv_free:.2f} GB free. "

f"DiT inference may OOM. Consider reducing max_model_len or using CPU offload."

)

print(

f"[nanovllm] KV cache allocated: {config.num_kvcache_blocks} blocks x {self.block_size} tokens = "

f"{max_tokens_capacity} tokens capacity, {kv_cache_size_gb:.2f} GB "

f"(free: {free / 1024**3:.2f} GB, used: {current / 1024**3:.2f} GB, "

f"target: {target_total_usage / 1024**3:.2f} GB, block: {block_bytes / 1024**2:.2f} MB, "

f"post_kv_free: {post_kv_free:.2f} GB)"

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 278-278: String contains ambiguous × (MULTIPLICATION SIGN). Did you mean x (LATIN SMALL LETTER X)?

(RUF001)

🤖 Prompt for AI Agents

In `@acestep/third_parts/nano-vllm/nanovllm/engine/model_runner.py` around lines 269 - 282, The f-string log in model_runner.py uses a Unicode multiplication sign (×) which triggers RUF001 and can cause copy/paste/terminal issues; update the print statement that formats KV cache info (the one referencing config.num_kvcache_blocks, self.block_size, max_tokens_capacity, kv_cache_size_gb, free, current, target_total_usage, block_bytes, post_kv_free) to replace the Unicode "×" with a plain ASCII "x" character so the message becomes e.g. "{config.num_kvcache_blocks} blocks x {self.block_size} tokens = ..." while keeping the rest of the formatting unchanged.

coderabbitai · 2026-02-10T07:50:22Z

docs/en/ace_step_musicians_guide.md

+    16-20 GB (enthusiast)    Songs up to 10 minutes
+    ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░    1-4 songs at a time
+                             Larger Songwriter brain (1.7B)
+


⚠️ Potential issue | 🟡 Minor

Batch size lower bound inconsistent with tier progression.

The enthusiast tier (16-20 GB) shows "1-4 songs at a time", but lower tiers (mainstream at 8-12GB and sweet spot at 12-16GB) already support "2-4 songs at a time". The lower bound of 1 for a higher tier doesn't follow a logical progression.

📝 Suggested correction

- 16-20 GB (enthusiast) Songs up to 10 minutes - ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time + 16-20 GB (enthusiast) Songs up to 10 minutes + ▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 2-4 songs at a time

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

16-20 GB (enthusiast) Songs up to 10 minutes

▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 1-4 songs at a time

Larger Songwriter brain (1.7B)

16-20 GB (enthusiast) Songs up to 10 minutes

▓▓▓▓▓▓▓▓▓░░░░░░░░░░░ 2-4 songs at a time

Larger Songwriter brain (1.7B)

🤖 Prompt for AI Agents

In `@docs/en/ace_step_musicians_guide.md` around lines 157 - 160, Update the enthusiast tier entry so its batch-size range follows the tier progression: locate the line containing "16-20 GB (enthusiast)" and the phrase "1-4 songs at a time" and change it to "2-4 songs at a time" (keeping the rest of the text, e.g., "Songs up to 10 minutes" and "Larger Songwriter brain (1.7B)" unchanged) so the lower bound is consistent with the 8-12GB and 12-16GB tiers.

coderabbitai · 2026-02-10T07:50:22Z

docs/en/BENCHMARK.md

+### 6. `tier-test` — Automated GPU Tier Testing
+
+Automatically simulates different GPU VRAM sizes using `MAX_CUDA_VRAM` and runs a generation test at each tier. This is the recommended way to validate that all GPU tiers work correctly after modifying `acestep/gpu_config.py`.
+
+```bash
+# Test all tiers (4, 6, 8, 12, 16, 20, 24 GB)
+python profile_inference.py --mode tier-test
+
+# Test specific VRAM sizes
+python profile_inference.py --mode tier-test --tiers 6 8 16
+
+# Test with LM enabled (where the tier supports it)
+python profile_inference.py --mode tier-test --tier-with-lm
+
+# Quick test: skip torch.compile for non-quantized tiers
+python profile_inference.py --mode tier-test --tier-skip-compile
+```
+
+**What it validates per tier:**
+- Correct tier detection and `GPUConfig` construction
+- Model initialization (DiT, VAE, Text Encoder, optionally LM)
+- A short generation run (30s duration, batch=1) completes without OOM
+- Adaptive VAE decode fallback (GPU → CPU offload → full CPU)
+- VRAM usage stays within the simulated limit
+
+**Output example:**
+
+```
+TIER TEST RESULTS
+====================================================================================================
+  VRAM    Tier       LM      Duration   Status    Peak VRAM    Notes
+  ──────────────────────────────────────────────────────────────────────────────
+  4GB     tier1      —       30s        ✅ OK     3.8GB        VAE decoded on CPU
+  6GB     tier2      —       30s        ✅ OK     5.4GB        Tiled VAE chunk=256
+  8GB     tier4      0.6B    30s        ✅ OK     7.2GB        vllm backend
+  12GB    tier5      1.7B    30s        ✅ OK     10.8GB       vllm backend
+  16GB    tier6a     1.7B    30s        ✅ OK     14.5GB       offload enabled
+  20GB    tier6b     1.7B    30s        ✅ OK     17.2GB       no offload
+  24GB    unlimited  4B      30s        ✅ OK     21.3GB       full models on GPU
+```
+
+> **Note**: `tier-test` mode uses `torch.cuda.set_per_process_memory_fraction()` to enforce a hard VRAM cap, making simulations realistic even on high-end GPUs (e.g., A100 80GB).
+
+#### Boundary Testing
+
+Use `--tier-boundary` to find the minimum VRAM tier at which INT8 quantization and CPU offload can be safely disabled. For each tier, up to three configurations are tested:
+
+1. **default** — tier's standard settings
+2. **no-quant** — quantization disabled, offload unchanged
+3. **no-offload** — no quantization AND no CPU offload
+
+```bash
+# Run boundary tests across all tiers
+python profile_inference.py --mode tier-test --tier-boundary
+
+# Boundary test with LM enabled
+python profile_inference.py --mode tier-test --tier-boundary --tier-with-lm
+
+# Save boundary results to JSON
+python profile_inference.py --mode tier-test --tier-boundary --benchmark-output boundary_results.json
+```
+
+The output includes a **Boundary Analysis** summary showing the minimum tier for each capability.
+


⚠️ Potential issue | 🟡 Minor

Add a language to the tier-test output code fence.
This fixes MD040 and keeps Markdown lint clean.

✏️ Suggested fix

-``` +```text TIER TEST RESULTS ==================================================================================================== ... -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 187-187: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/en/BENCHMARK.md` around lines 160 - 223, The sample output code fence under the "tier-test" section (the "TIER TEST RESULTS" block) lacks a language tag; update the opening fence from ``` to ```text to satisfy MD040 and Markdown linting, leaving the fence contents and closing ``` unchanged so the block is explicitly marked as plain text.

coderabbitai · 2026-02-10T07:50:22Z

profile_inference.py

+        print(f"  max_batch_without_lm: {gpu_config.max_batch_size_without_lm}")
+
+        # ---- Test 1: Default configuration ----
+        print(f"\n  --- Variant: default ---")


⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefix.

This string has no placeholders but uses an f-string prefix.

🔧 Proposed fix

- print(f"\n --- Variant: default ---") + print("\n --- Variant: default ---")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f"\n --- Variant: default ---")

print("\n --- Variant: default ---")

🧰 Tools

🪛 Ruff (0.14.14)

[error] 1072-1072: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In `@profile_inference.py` at line 1072, The print call using an unnecessary f-string should be changed to a regular string: locate the statement print(f"\n --- Variant: default ---") in profile_inference.py and remove the leading f so it becomes print("\n --- Variant: default ---"); no other behavior changes are needed.

coderabbitai · 2026-02-10T07:50:22Z

profile_inference.py

+            else:
+                print(f"\n  --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")
+
+            # ---- Test 3: No quantization AND no offload ----
+            # Skip if the tier already has both disabled
+            if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:
+                print(f"\n  --- Variant: no-offload (quant=None, offload=False) ---")
+                result_no_offload = _run_single_tier_test(
+                    sim_gb, gpu_config, args, example_data,
+                    checkpoint_dir, disk_lm_models,
+                    offload_override=False,
+                    offload_dit_override=False,
+                    quantization_override=None,
+                    test_variant="no-offload",
+                )
+                all_results.append(result_no_offload)
+            else:
+                print(f"\n  --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")


⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefixes.

Several print statements use f-strings without any placeholders.

🔧 Proposed fix

else: - print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---") + print("\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---") # ---- Test 3: No quantization AND no offload ---- # Skip if the tier already has both disabled if gpu_config.quantization_default or gpu_config.offload_to_cpu_default: - print(f"\n --- Variant: no-offload (quant=None, offload=False) ---") + print("\n --- Variant: no-offload (quant=None, offload=False) ---") result_no_offload = _run_single_tier_test( ... ) all_results.append(result_no_offload) else: - print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---") + print("\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

else:

print(f"\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")

# ---- Test 3: No quantization AND no offload ----

# Skip if the tier already has both disabled

if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:

print(f"\n --- Variant: no-offload (quant=None, offload=False) ---")

result_no_offload = _run_single_tier_test(

sim_gb, gpu_config, args, example_data,

checkpoint_dir, disk_lm_models,

offload_override=False,

offload_dit_override=False,

quantization_override=None,

test_variant="no-offload",

)

all_results.append(result_no_offload)

else:

print(f"\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")

else:

print("\n --- Variant: no-quant — SKIPPED (tier already has quantization=False) ---")

# ---- Test 3: No quantization AND no offload ----

# Skip if the tier already has both disabled

if gpu_config.quantization_default or gpu_config.offload_to_cpu_default:

print("\n --- Variant: no-offload (quant=None, offload=False) ---")

result_no_offload = _run_single_tier_test(

sim_gb, gpu_config, args, example_data,

checkpoint_dir, disk_lm_models,

offload_override=False,

offload_dit_override=False,

quantization_override=None,

test_variant="no-offload",

)

all_results.append(result_no_offload)

else:

print("\n --- Variant: no-offload — SKIPPED (tier already has offload=False, quant=False) ---")

🧰 Tools

🪛 Ruff (0.14.14)

[error] 1093-1093: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 1098-1098: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 1109-1109: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In `@profile_inference.py` around lines 1092 - 1109, Fix the unnecessary f-string prefixes on print statements that have no interpolations: replace print(f"...") with print("...") for the messages around the "no-quant" and "no-offload" variants in the block that calls _run_single_tier_test; specifically update the print calls that reference the no-quant and no-offload messages which use gpu_config.quantization_default and gpu_config.offload_to_cpu_default to determine skips so they are ordinary string literals instead of f-strings.

coderabbitai · 2026-02-10T07:50:22Z

profile_inference.py

+            if failing:
+                print(f"  {capability_name}:")
+                print(f"    ❌ No tier passed this test. All tested tiers failed.")
+                for r in failing:
+                    err = (r.get("error") or "unknown")[:50]
+                    print(f"       {r['tier_gb']}GB ({r['tier']}): {err}")
+            else:
+                print(f"  {capability_name}:")
+                print(f"    ⚠️ No test results available for this capability.")
+        print()
+        return passing[0] if passing else None


⚠️ Potential issue | 🟡 Minor

Remove extraneous f-string prefixes in boundary summary.

Two print statements use f-strings without placeholders.

🔧 Proposed fix

if failing: - print(f" {capability_name}:") - print(f" ❌ No tier passed this test. All tested tiers failed.") + print(f" {capability_name}:") # This one is fine, has placeholder + print(" ❌ No tier passed this test. All tested tiers failed.") for r in failing: err = (r.get("error") or "unknown")[:50] print(f" {r['tier_gb']}GB ({r['tier']}): {err}") else: print(f" {capability_name}:") - print(f" ⚠️ No test results available for this capability.") + print(" ⚠️ No test results available for this capability.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if failing:

print(f" {capability_name}:")

print(f" ❌ No tier passed this test. All tested tiers failed.")

for r in failing:

err = (r.get("error") or "unknown")[:50]

print(f" {r['tier_gb']}GB ({r['tier']}): {err}")

else:

print(f" {capability_name}:")

print(f" ⚠️ No test results available for this capability.")

print()

return passing[0] if passing else None

if failing:

print(f" {capability_name}:") # This one is fine, has placeholder

print(" ❌ No tier passed this test. All tested tiers failed.")

for r in failing:

err = (r.get("error") or "unknown")[:50]

print(f" {r['tier_gb']}GB ({r['tier']}): {err}")

else:

print(f" {capability_name}:")

print(" ⚠️ No test results available for this capability.")

print()

return passing[0] if passing else None

🧰 Tools

🪛 Ruff (0.14.14)

[error] 1278-1278: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 1284-1284: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

In `@profile_inference.py` around lines 1276 - 1286, The two print calls that use f-strings without placeholders should be regular strings: in the block that references capability_name, failing, and passing (the one that prints the boundary summary and returns passing[0] if any), replace print(f" ❌ No tier passed this test. All tested tiers failed.") and print(f" ⚠️ No test results available for this capability.") with print(" ❌ No tier passed this test. All tested tiers failed.") and print(" ⚠️ No test results available for this capability.") respectively so the unnecessary f-string prefixes are removed.

coderabbitai · 2026-02-10T07:50:22Z

scripts/profile_vram.py

+            try:
+                with torch.inference_mode():
+                    # Simulate DiT inference inputs
+                    # Create dummy latent noise
+                    noise = torch.randn(batch_size, latent_length, latent_dim, device="cuda", dtype=torch.bfloat16)
+
+                    # Simulate text encoder output
+                    text_hidden = torch.randn(batch_size, 512, 768, device="cuda", dtype=torch.bfloat16)
+                    text_mask = torch.ones(batch_size, 512, device="cuda", dtype=torch.long)
+
+                    # If has CFG, double the batch for classifier-free guidance
+                    if has_cfg:
+                        noise_cfg = torch.cat([noise, noise], dim=0)
+                        text_hidden_cfg = torch.cat([text_hidden, text_hidden], dim=0)
+                        text_mask_cfg = torch.cat([text_mask, text_mask], dim=0)
+                        del noise_cfg, text_hidden_cfg, text_mask_cfg
+
+                    del noise, text_hidden, text_mask
+                    torch.cuda.synchronize()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and understand its context wc -l scripts/profile_vram.py

Repository: ace-step/ACE-Step-1.5

Length of output: 92

🏁 Script executed:

# Let's read the file to understand the full context around lines 165-183 cat -n scripts/profile_vram.py | sed -n '150,200p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2780

🏁 Script executed:

# Let's also see the beginning of the function that contains this code cat -n scripts/profile_vram.py | sed -n '1,50p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2044

🏁 Script executed:

# Let's see the function that contains this code - find where it starts cat -n scripts/profile_vram.py | sed -n '100,165p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2778

🏁 Script executed:

# Search for model.forward or model( calls in the file rg -n "model\.|\.forward\(|\.generate\(" scripts/profile_vram.py

Repository: ace-step/ACE-Step-1.5

Length of output: 347

🏁 Script executed:

# Let's see the complete try-except block to verify if model is called anywhere cat -n scripts/profile_vram.py | sed -n '165,210p'

Repository: ace-step/ACE-Step-1.5

Length of output: 2706

DiT "inference" profiling doesn't run the model.

The loop only allocates dummy tensors (lines 169–182) and deletes them; it never executes a forward pass. The peak memory measurement (line 186) captures only input tensor allocation, excluding all activation memory from the actual model computation. This will severely under-report peak VRAM usage and mis-calibrate GPU tier selection. Either run a minimal forward pass with model(noise, text_hidden, text_mask) to include activations, or rename peak_inference_gb to peak_input_allocation_gb to clarify the scope.

🤖 Prompt for AI Agents

In `@scripts/profile_vram.py` around lines 165 - 183, The current DiT profiling only allocates and deletes dummy tensors (noise, text_hidden, text_mask) and never executes the model, so peak memory misses activation usage; replace the no-op block with a minimal forward pass by calling the DiT model (e.g., model(noise, text_hidden, text_mask) or model.forward(...)) inside the torch.inference_mode() context so activations are allocated and measured, and when has_cfg is true duplicate the inputs (noise_cfg, text_hidden_cfg, text_mask_cfg) and pass the doubled batch to the model to simulate classifier-free guidance; alternatively, if you intentionally only want to measure input allocation, rename peak_inference_gb to peak_input_allocation_gb to reflect the narrower measurement.

coderabbitai · 2026-02-10T07:50:22Z

scripts/profile_vram.py

+    encoder_path = os.path.join(checkpoint_dir, "text_encoder")
+    if not os.path.exists(encoder_path):
+        print(f"  Text encoder not found: {encoder_path}")
+        return {}


⚠️ Potential issue | 🟠 Major

Text encoder path doesn’t match runtime checkpoints.

The handler loads the encoder from Qwen3-Embedding-0.6B, but this script looks for text_encoder, so profiling will likely skip it. Align the path or add a fallback.

🛠️ Proposed fix (use runtime path with fallback)

- encoder_path = os.path.join(checkpoint_dir, "text_encoder") - if not os.path.exists(encoder_path): + encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B") + if not os.path.exists(encoder_path): + encoder_path = os.path.join(checkpoint_dir, "text_encoder") + if not os.path.exists(encoder_path): print(f" Text encoder not found: {encoder_path}") return {}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

encoder_path = os.path.join(checkpoint_dir, "text_encoder")

if not os.path.exists(encoder_path):

print(f" Text encoder not found: {encoder_path}")

return {}

encoder_path = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")

if not os.path.exists(encoder_path):

encoder_path = os.path.join(checkpoint_dir, "text_encoder")

if not os.path.exists(encoder_path):

print(f" Text encoder not found: {encoder_path}")

return {}

🤖 Prompt for AI Agents

In `@scripts/profile_vram.py` around lines 318 - 321, The code builds encoder_path = os.path.join(checkpoint_dir, "text_encoder") which doesn't match the runtime checkpoint name (e.g., "Qwen3-Embedding-0.6B"), so profiling can skip the encoder; update the logic in scripts/profile_vram.py around encoder_path to try the runtime checkpoint name as a fallback (check for os.path.exists(os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")) if the "text_encoder" path is missing) and only return {} after both attempts fail, or prefer the runtime-named directory when present; ensure references to encoder_path, checkpoint_dir and the literal names ("text_encoder", "Qwen3-Embedding-0.6B") are used so the handler and this script align.

…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing

Root cause: tier6a (16-20GB) had max_batch_size_with_lm=1, which was overly conservative. Empirical testing on 16GB (simulated) showed: - Without LM: batch=4 uses 13.3GB, batch=7 uses 13.4GB (all fit in 16GB) - With LM (1.7B): batch=2 uses 11.9GB, batch=4 fits within 16GB budget Changes: - tier6a: max_batch_size_with_lm 1→4, max_batch_size_without_lm 4→8 - tier6b: max_batch_size_with_lm 2→4 (20-24GB has ample headroom) - Added --tier-batch-boundary flag to profile_inference.py for automated batch size boundary testing (escalates 1,2,4,8 with LM and without LM) - Added GPU tier config patching during batch tests to bypass inference.py batch clamping - Updated GPU_COMPATIBILITY docs (en/zh/ja/ko) and BENCHMARK docs (en/zh) with corrected batch limits and new batch boundary testing instructions - Updated tests to match new batch size expectations

…bled Root cause: _vram_guard_reduce_batch checks free VRAM *before* DiT runs, but at that point the vllm LM model (weights + KV cache) is still on GPU. On a 16GB GPU with 1.7B LM loaded, only ~7.6GB appears free, causing the guard to slash batch_size from 4 to 1 — even though the LM will be offloaded before DiT actually needs the memory. Fix: When offload_to_cpu=True, trust the static GPU tier config limits (which were empirically validated with offload enabled) instead of the misleading instantaneous free VRAM reading. If batch_size <= tier's max_batch_size_with_lm, skip the dynamic VRAM check entirely. This fixes the bug where users with 16GB GPUs saw LM generate 4 audio codes but DiT only produced 1 output.

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In `@docs/en/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block showing the "BOUNDARY ANALYSIS"
table lacks a language specifier; update the opening fence that precedes the
BOUNDARY ANALYSIS block from ``` to ```text so the block is explicitly marked as
plain text (look for the lines containing the literal "BOUNDARY ANALYSIS" and
the surrounding triple-backtick fence and change the opening fence accordingly).
- Around line 79-83: Update the tier labels for the GPU simulation examples:
change the comment "Simulate an 8GB GPU (Tier 4)" to "Simulate an 8GB GPU (Tier
3)" and change "Simulate a 12GB GPU (Tier 5)" to "Simulate a 12GB GPU (Tier 4)"
for the lines with MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run
acestep so the examples match the mapping (≤8GB → tier3, ≤12GB → tier4).

In `@docs/ja/GPU_COMPATIBILITY.md`:
- Around line 79-83: Update the comment labels for the GPU simulation examples:
change the "8GB GPU (Tier 4) をシミュレート" comment to "8GB GPU (Tier 3) をシミュレート" and
change the "12GB GPU (Tier 5) をシミュレート" comment to "12GB GPU (Tier 4) をシミュレート" so
the comments match the tier mapping for the MAX_CUDA_VRAM examples (the lines
using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12 before running "uv run acestep").

In `@docs/ko/GPU_COMPATIBILITY.md`:
- Around line 79-83: The tier labels are incorrect for the 8GB/12GB examples;
update the headings for the examples shown (the commented lines above the
commands using MAX_CUDA_VRAM and uv run acestep) so the 8GB example reads "8GB
GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB GPU 시뮬레이션 (티어 4)" to match
the mapping ≤8GB → tier3 and ≤12GB → tier4.

In `@docs/zh/BENCHMARK.md`:
- Around line 165-166: Update the "测试所有等级" section so the listed tiers match the
actual default tiers used by profile_inference.py; replace the current list
(which shows 20GB and omits 48GB) with the real defaults (include 48GB and
remove 20GB), or add a short note stating that 20GB is not a default and must be
provided via the --tiers flag; refer to the "python profile_inference.py --mode
tier-test" invocation and the default tier list in the script when making the
change.

In `@docs/zh/GPU_COMPATIBILITY.md`:
- Around line 141-149: The fenced code block that starts with the "BOUNDARY
ANALYSIS" header should include a language specifier to ensure proper rendering;
update the opening triple-backtick for the block containing "BOUNDARY ANALYSIS"
and the table (the block that currently begins with ``` and the header line
"BOUNDARY ANALYSIS") to use a language tag (e.g., change ``` to ```text) so the
table renders as plain text in documentation.
- Around line 79-83: Update the tier labels in the examples: change the heading
"模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)" and change "模拟 12GB GPU (Tier 5)"
to "模拟 12GB GPU (Tier 4)"; keep the example commands (MAX_CUDA_VRAM=8 uv run
acestep and MAX_CUDA_VRAM=12 uv run acestep) unchanged and ensure the
documentation reflects the mapping ≤8GB → tier3 and ≤12GB → tier4.

In `@profile_inference.py`:
- Around line 789-798: The code may pick a disk-only LM that is too large for
the current tier because disk_lm_models are not filtered by tier size; before
calling find_best_lm_model_on_disk, filter disk_lm_models to only include models
whose size is compatible with the current tier (use the tier variable and
gpu_config-recommended sizing rules) and then pass that filtered list to
find_best_lm_model_on_disk (keep references to lm_model, use_lm,
find_best_lm_model_on_disk, disk_lm_models, gpu_config.recommended_lm_model and
tier so the change is easy to locate).
- Around line 815-823: The current CUDA memory-fraction logic only sets a
reduced fraction when sim_gb < physical VRAM, but doesn't reset the per-process
cap when sim_gb >= physical VRAM, leaving a prior smaller cap in place; update
the block in profile_inference.py that checks torch.cuda.is_available() (the
code using torch.cuda.get_device_properties, total_gb, sim_gb and
torch.cuda.set_per_process_memory_fraction) so that when sim_gb >= total_gb you
explicitly call torch.cuda.set_per_process_memory_fraction(1.0) to clear any
previous cap; retain the existing reduced-fraction calculation path for sim_gb <
total_gb.

coderabbitai · 2026-02-10T13:16:10Z

docs/en/GPU_COMPATIBILITY.md

 # Simulate an 8GB GPU (Tier 4)
 MAX_CUDA_VRAM=8 uv run acestep

 # Simulate a 12GB GPU (Tier 5)
 MAX_CUDA_VRAM=12 uv run acestep


⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update

-# Simulate an 8GB GPU (Tier 4) +# Simulate an 8GB GPU (Tier 3) -# Simulate a 12GB GPU (Tier 5) +# Simulate a 12GB GPU (Tier 4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Simulate an 8GB GPU (Tier 4)

MAX_CUDA_VRAM=8 uv run acestep

# Simulate a 12GB GPU (Tier 5)

MAX_CUDA_VRAM=12 uv run acestep

# Simulate an 8GB GPU (Tier 3)

MAX_CUDA_VRAM=8 uv run acestep

# Simulate a 12GB GPU (Tier 4)

MAX_CUDA_VRAM=12 uv run acestep

🤖 Prompt for AI Agents

In `@docs/en/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels for the GPU simulation examples: change the comment "Simulate an 8GB GPU (Tier 4)" to "Simulate an 8GB GPU (Tier 3)" and change "Simulate a 12GB GPU (Tier 5)" to "Simulate a 12GB GPU (Tier 4)" for the lines with MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run acestep so the examples match the mapping (≤8GB → tier3, ≤12GB → tier4).

coderabbitai · 2026-02-10T13:16:10Z

docs/en/GPU_COMPATIBILITY.md

+```
+BOUNDARY ANALYSIS
+=================
+  Capability                                    Min Tier   VRAM
+  ------------------------------------------------------------
+  No INT8 Quantization                          tier6b      20GB
+  No CPU Offload (all models on GPU)            tier6b      20GB
+  ------------------------------------------------------------
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the boundary-analysis output block.

📝 Suggested update

-``` +```text BOUNDARY ANALYSIS ================= Capability Min Tier VRAM

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 141-141: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In `@docs/en/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block showing the "BOUNDARY ANALYSIS" table lacks a language specifier; update the opening fence that precedes the BOUNDARY ANALYSIS block from ``` to ```text so the block is explicitly marked as plain text (look for the lines containing the literal "BOUNDARY ANALYSIS" and the surrounding triple-backtick fence and change the opening fence accordingly).

coderabbitai · 2026-02-10T13:16:10Z

docs/ja/GPU_COMPATIBILITY.md

 # 8GB GPU (Tier 4) をシミュレート
 MAX_CUDA_VRAM=8 uv run acestep

 # 12GB GPU (Tier 5) をシミュレート
 MAX_CUDA_VRAM=12 uv run acestep


⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update

-# 8GB GPU (Tier 4) をシミュレート +# 8GB GPU (Tier 3) をシミュレート -# 12GB GPU (Tier 5) をシミュレート +# 12GB GPU (Tier 4) をシミュレート

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 8GB GPU (Tier 4) をシミュレート

MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU (Tier 5) をシミュレート

MAX_CUDA_VRAM=12 uv run acestep

# 8GB GPU (Tier 3) をシミュレート

MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU (Tier 4) をシミュレート

MAX_CUDA_VRAM=12 uv run acestep

🤖 Prompt for AI Agents

In `@docs/ja/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the comment labels for the GPU simulation examples: change the "8GB GPU (Tier 4) をシミュレート" comment to "8GB GPU (Tier 3) をシミュレート" and change the "12GB GPU (Tier 5) をシミュレート" comment to "12GB GPU (Tier 4) をシミュレート" so the comments match the tier mapping for the MAX_CUDA_VRAM examples (the lines using MAX_CUDA_VRAM=8 and MAX_CUDA_VRAM=12 before running "uv run acestep").

coderabbitai · 2026-02-10T13:16:11Z

docs/ko/GPU_COMPATIBILITY.md

 # 8GB GPU 시뮬레이션 (티어 4)
 MAX_CUDA_VRAM=8 uv run acestep

 # 12GB GPU 시뮬레이션 (티어 5)
 MAX_CUDA_VRAM=12 uv run acestep


⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update

-# 8GB GPU 시뮬레이션 (티어 4) +# 8GB GPU 시뮬레이션 (티어 3) -# 12GB GPU 시뮬레이션 (티어 5) +# 12GB GPU 시뮬레이션 (티어 4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 8GB GPU 시뮬레이션 (티어 4)

MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU 시뮬레이션 (티어 5)

MAX_CUDA_VRAM=12 uv run acestep

# 8GB GPU 시뮬레이션 (티어 3)

MAX_CUDA_VRAM=8 uv run acestep

# 12GB GPU 시뮬레이션 (티어 4)

MAX_CUDA_VRAM=12 uv run acestep

🤖 Prompt for AI Agents

In `@docs/ko/GPU_COMPATIBILITY.md` around lines 79 - 83, The tier labels are incorrect for the 8GB/12GB examples; update the headings for the examples shown (the commented lines above the commands using MAX_CUDA_VRAM and uv run acestep) so the 8GB example reads "8GB GPU 시뮬레이션 (티어 3)" and the 12GB example reads "12GB GPU 시뮬레이션 (티어 4)" to match the mapping ≤8GB → tier3 and ≤12GB → tier4.

coderabbitai · 2026-02-10T13:16:11Z

docs/zh/BENCHMARK.md

+# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
+python profile_inference.py --mode tier-test


⚠️ Potential issue | 🟡 Minor

Align the “test all tiers” list with actual tier-test defaults.

The default tier list in code includes 48GB and does not include 20GB; the doc should match or explicitly note that 20GB is only included via --tiers.

📝 Suggested update

-# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB) +# 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)

python profile_inference.py --mode tier-test

# 测试所有等级 (4, 6, 8, 12, 16, 24, 48 GB)

python profile_inference.py --mode tier-test

🤖 Prompt for AI Agents

In `@docs/zh/BENCHMARK.md` around lines 165 - 166, Update the "测试所有等级" section so the listed tiers match the actual default tiers used by profile_inference.py; replace the current list (which shows 20GB and omits 48GB) with the real defaults (include 48GB and remove 20GB), or add a short note stating that 20GB is not a default and must be provided via the --tiers flag; refer to the "python profile_inference.py --mode tier-test" invocation and the default tier list in the script when making the change.

coderabbitai · 2026-02-10T13:16:11Z

docs/zh/GPU_COMPATIBILITY.md

 # 模拟 8GB GPU (Tier 4)
 MAX_CUDA_VRAM=8 uv run acestep

 # 模拟 12GB GPU (Tier 5)
 MAX_CUDA_VRAM=12 uv run acestep


⚠️ Potential issue | 🟡 Minor

Correct tier labels for 8GB/12GB simulation examples.

Tier mapping is ≤8GB → tier3 and ≤12GB → tier4.

📝 Suggested update

-# 模拟 8GB GPU (Tier 4) +# 模拟 8GB GPU (Tier 3) -# 模拟 12GB GPU (Tier 5) +# 模拟 12GB GPU (Tier 4)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# 模拟 8GB GPU (Tier 4)

MAX_CUDA_VRAM=8 uv run acestep

# 模拟 12GB GPU (Tier 5)

MAX_CUDA_VRAM=12 uv run acestep

# 模拟 8GB GPU (Tier 3)

MAX_CUDA_VRAM=8 uv run acestep

# 模拟 12GB GPU (Tier 4)

MAX_CUDA_VRAM=12 uv run acestep

🤖 Prompt for AI Agents

In `@docs/zh/GPU_COMPATIBILITY.md` around lines 79 - 83, Update the tier labels in the examples: change the heading "模拟 8GB GPU (Tier 4)" to "模拟 8GB GPU (Tier 3)" and change "模拟 12GB GPU (Tier 5)" to "模拟 12GB GPU (Tier 4)"; keep the example commands (MAX_CUDA_VRAM=8 uv run acestep and MAX_CUDA_VRAM=12 uv run acestep) unchanged and ensure the documentation reflects the mapping ≤8GB → tier3 and ≤12GB → tier4.

coderabbitai · 2026-02-10T13:16:11Z

docs/zh/GPU_COMPATIBILITY.md

+```
+BOUNDARY ANALYSIS
+=================
+  Capability                                    Min Tier   VRAM
+  ------------------------------------------------------------
+  No INT8 Quantization                          tier6b      20GB
+  No CPU Offload (all models on GPU)            tier6b      20GB
+  ------------------------------------------------------------
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the boundary-analysis output block.

📝 Suggested update

-``` +```text BOUNDARY ANALYSIS ================= Capability Min Tier VRAM

🤖 Prompt for AI Agents

In `@docs/zh/GPU_COMPATIBILITY.md` around lines 141 - 149, The fenced code block that starts with the "BOUNDARY ANALYSIS" header should include a language specifier to ensure proper rendering; update the opening triple-backtick for the block containing "BOUNDARY ANALYSIS" and the table (the block that currently begins with ``` and the header line "BOUNDARY ANALYSIS") to use a language tag (e.g., change ``` to ```text) so the table renders as plain text in documentation.

coderabbitai · 2026-02-10T13:16:11Z

profile_inference.py

+    # Find LM model on disk
+    lm_model = None
+    lm_backend = gpu_config.recommended_backend
+    if use_lm:
+        lm_model = find_best_lm_model_on_disk(
+            gpu_config.recommended_lm_model, disk_lm_models
+        )
+        if not lm_model:
+            print(f"  ⚠️ No compatible LM model on disk for tier {tier}, skipping LM")
+            use_lm = False


⚠️ Potential issue | 🟠 Major

Filter disk LM models by tier-compatible sizes before selection.

Without size filtering, a larger disk-only LM can be chosen and cause false failures on low tiers.

🛠️ Suggested fix

if use_lm: - lm_model = find_best_lm_model_on_disk( - gpu_config.recommended_lm_model, disk_lm_models - ) + compatible_models = [ + m for m in disk_lm_models + if is_lm_model_size_allowed(m, gpu_config.available_lm_models) + ] + lm_model = find_best_lm_model_on_disk( + gpu_config.recommended_lm_model, compatible_models + ) if not lm_model: print(f" ⚠️ No compatible LM model on disk for tier {tier}, skipping LM") use_lm = False

🤖 Prompt for AI Agents

In `@profile_inference.py` around lines 789 - 798, The code may pick a disk-only LM that is too large for the current tier because disk_lm_models are not filtered by tier size; before calling find_best_lm_model_on_disk, filter disk_lm_models to only include models whose size is compatible with the current tier (use the tier variable and gpu_config-recommended sizing rules) and then pass that filtered list to find_best_lm_model_on_disk (keep references to lm_model, use_lm, find_best_lm_model_on_disk, disk_lm_models, gpu_config.recommended_lm_model and tier so the change is easy to locate).

coderabbitai · 2026-02-10T13:16:11Z

profile_inference.py

+    if torch.cuda.is_available():
+        total_bytes = torch.cuda.get_device_properties(0).total_memory
+        total_gb = total_bytes / (1024 ** 3)
+        if sim_gb < total_gb:
+            reference_context_gb = 0.5
+            allocator_budget_gb = max(0.5, sim_gb - reference_context_gb)
+            fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb))
+            torch.cuda.set_per_process_memory_fraction(fraction)
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "profile_inference.py" -type f

Repository: ace-step/ACE-Step-1.5

Length of output: 87

🏁 Script executed:

# Get file size first wc -l profile_inference.py 2>/dev/null || echo "File not found, searching..." find . -name "*.py" -path "*profile*" -type f

Repository: ace-step/ACE-Step-1.5

Length of output: 139

🏁 Script executed:

# Read the specific lines with broader context (lines 800-835) sed -n '800,835p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 1767

🏁 Script executed:

# Search for tier-test related code and loop structures rg -n "tier" profile_inference.py -i -A 2 -B 2

Repository: ace-step/ACE-Step-1.5

Length of output: 22119

🏁 Script executed:

# Look for where this function is called rg -n "torch.cuda.set_per_process_memory_fraction" . --type py -B 5 -A 5

Repository: ace-step/ACE-Step-1.5

Length of output: 2768

🏁 Script executed:

# Check PyTorch documentation or source for set_per_process_memory_fraction behavior rg -n "set_per_process_memory_fraction\|memory_fraction" . --type py -B 2 -A 2

Repository: ace-step/ACE-Step-1.5

Length of output: 47

🏁 Script executed:

# Look at the loop structure more carefully - check if tiers are tested in order and if multiple variants are called per tier sed -n '1075,1190p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 6728

🏁 Script executed:

# Check if there's any initialization or reset code before/after the tier loop sed -n '1019,1100p' profile_inference.py | cat -n

Repository: ace-step/ACE-Step-1.5

Length of output: 4212

🏁 Script executed:

# Search for any documentation or comments about the memory fraction behavior rg -n "memory.fraction\|cuda.*fraction\|1\.0\|reset" profile_inference.py -i | head -30

Repository: ace-step/ACE-Step-1.5

Length of output: 47

🏁 Script executed:

# Check if there's any documentation or code comments about memory fraction in the codebase rg -n "1\.0.*fraction\|fraction.*1\.0\|unrestricted\|reset.*cuda\|cuda.*reset" . --type py -i

Repository: ace-step/ACE-Step-1.5

Length of output: 47

🏁 Script executed:

# Look for any tests or examples of set_per_process_memory_fraction to understand typical usage rg -n "set_per_process_memory_fraction" . --type py -B 3 -A 3

Repository: ace-step/ACE-Step-1.5

Length of output: 3654

Reset CUDA memory fraction to 1.0 when sim_gb ≥ physical GPU VRAM.

Within each tier, multiple test variants (default, no-quant, no-offload, batch-N) are run sequentially. A smaller fraction set by an earlier variant persists and caps later variants, causing false failures or skewed VRAM measurements. Always reset to 1.0 when the simulated VRAM is at or above physical capacity to clear the prior cap.

🛠️ Suggested fix

if sim_gb < total_gb: reference_context_gb = 0.5 allocator_budget_gb = max(0.5, sim_gb - reference_context_gb) fraction = max(0.01, min(1.0, allocator_budget_gb / total_gb)) torch.cuda.set_per_process_memory_fraction(fraction) + else: + # Ensure we don't keep a tighter cap from a previous tier or variant + torch.cuda.set_per_process_memory_fraction(1.0)

🤖 Prompt for AI Agents

In `@profile_inference.py` around lines 815 - 823, The current CUDA memory-fraction logic only sets a reduced fraction when sim_gb < physical VRAM, but doesn't reset the per-process cap when sim_gb >= physical VRAM, leaving a prior smaller cap in place; update the block in profile_inference.py that checks torch.cuda.is_available() (the code using torch.cuda.get_device_properties, total_gb, sim_gb and torch.cuda.set_per_process_memory_fraction) so that when sim_gb >= total_gb you explicitly call torch.cuda.set_per_process_memory_fraction(1.0) to clear any previous cap; retain the existing reduced-fraction calculation path for sim_gb < total_gb.

8GB GPUs with 0.6B LM + offload have enough headroom for batch=2. DiT(4.46) + context(0.5) ≈ 5.0GB leaves ~3GB free, sufficient for 2 samples of DiT activations (~0.8GB each at 60s). Updated gpu_config.py and docs (en/zh/ja).

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@acestep/handler.py`:
- Around line 1605-1607: The device check currently compares the device variable
to strings and can miss non-CUDA devices (e.g., torch.device objects like
"xpu"), causing CUDA-only VRAM logic to run incorrectly; update the check in the
handler where device is read (the block using device = self.device and the
similar block at the later occurrence) to reliably detect CUDA by converting to
string or using device.type (e.g., if str(device) in ("cpu","mps") or if
getattr(device,"type",str(device)) in ("cpu","mps")), and keep the subsequent
call to get_effective_free_vram_gb() only for CUDA devices so batch_size isn't
forced to 1 on non-CUDA devices.

🧹 Nitpick comments (1)

acestep/handler.py (1)
1583-1586: use_lm is currently unused in the guard.

Either drop it or use it to pick the appropriate tier limit so callers can control whether LM-specific caps apply.
♻️ Possible refinement
-                tier_max = gpu_config.max_batch_size_with_lm
+                tier_max = (
+                    gpu_config.max_batch_size_with_lm
+                    if use_lm
+                    else gpu_config.max_batch_size_without_lm
+                )
Also applies to: 1615-1622

coderabbitai · 2026-02-10T13:24:56Z

acestep/handler.py

+        device = self.device
+        if device == "cpu" or device == "mps":
+            return batch_size  # No CUDA VRAM to guard


⚠️ Potential issue | 🟡 Minor

Skip CUDA-only free-VRAM checks on non-CUDA devices to avoid forced batch=1.

The current guard uses string equality comparison on what is likely a torch.device object, which fails for non-CUDA devices. Devices like "xpu" bypass this check and proceed to call get_effective_free_vram_gb(), which may report 0 VRAM and collapse batch size. Convert the device to string before comparison or use .type attribute to reliably detect CUDA devices.

🛠️ Suggested fix

- device = self.device - if device == "cpu" or device == "mps": - return batch_size # No CUDA VRAM to guard + device = self.device + device_str = str(device) + is_cuda = device_str == "cuda" or device_str.startswith("cuda") + if not is_cuda: + return batch_size # No CUDA VRAM to guard

Also applies to: 1631-1633

🤖 Prompt for AI Agents

In `@acestep/handler.py` around lines 1605 - 1607, The device check currently compares the device variable to strings and can miss non-CUDA devices (e.g., torch.device objects like "xpu"), causing CUDA-only VRAM logic to run incorrectly; update the check in the handler where device is read (the block using device = self.device and the similar block at the later occurrence) to reliably detect CUDA by converting to string or using device.type (e.g., if str(device) in ("cpu","mps") or if getattr(device,"type",str(device)) in ("cpu","mps")), and keep the subsequent call to get_effective_free_vram_gb() only for CUDA devices so batch_size isn't forced to 1 on non-CUDA devices.

When batch_size > 1, VAE decode VRAM scales linearly with batch size. On 8GB GPUs (tier3, batch=2), decoding 2 samples at once exceeds VRAM. Fix: In _tiled_decode_inner, when B > 1, decode each sample individually and move results to CPU immediately after each decode. This keeps peak VRAM constant regardless of batch size. Also updated tier3 max_batch_size_with_lm from 1 to 2 (8GB GPUs with 0.6B LM + offload have sufficient headroom for batch=2).

Now that VAE decode is batch-sequential (no extra VRAM per sample), the bottleneck is only DiT activations which scale modestly. Updated batch limits: - tier5 (12-16GB): with_lm 2→4, without_lm stays 4 - tier6b (20-24GB): with_lm 4→8, without_lm stays 8 Summary of all tiers (LM / No LM): tier1 ≤4GB: 1/1 tier4 8-12GB: 2/4 tier2 4-6GB: 1/1 tier5 12-16GB: 4/4 tier3 6-8GB: 2/2 tier6a 16-20GB: 4/8 tier6b 20-24GB: 8/8 unlimited ≥24GB: 8/8 Updated docs (en/zh/ja).

- test_time_scaling.py: Add _load_scoring_model_context() that moves the HF scoring model to GPU only during forward pass and offloads back to CPU afterwards (for vllm/mlx backends). Move output logits to CPU to avoid keeping large vocab tensors on GPU. - llm_inference.py: When offload_to_cpu=True, keep the HF scoring model on CPU after initial loading (vllm/mlx backends). The context manager in test_time_scaling.py handles GPU placement on demand. - dit_alignment_score.py: Force MusicLyricScorer.calculate_score() to always compute on CPU. The scoring matrices are small and do not benefit from GPU acceleration, while occupying VRAM that DiT/VAE/LM need on low-VRAM GPUs.

…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing

Keep _vram_guard_reduce_batch (our feature). Remove _start_diffusion_progress_estimator (now provided by ProgressMixin from acestep/core/generation/handler/progress.py).

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

chuxij and others added 4 commits February 10, 2026 07:51

remove vram_profile_results.json

6d545ac

Merge branch 'main' into feat/gpu-compatibility-tier-boundary-testing

c4fbac7

feat: skipped no-offload when vram not enough

94bbd13

Merge branch 'feat/gpu-compatibility-tier-boundary-testing' of https:…

fc52297

…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing

ChuxiJ mentioned this pull request Feb 10, 2026

Latest origin/main has increased VRAM usage #384

Open

chuxij added 2 commits February 10, 2026 13:06

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

fix: tier3 (6-8GB) max_batch_size_with_lm 1→2

b9cc162

8GB GPUs with 0.6B LM + offload have enough headroom for batch=2. DiT(4.46) + context(0.5) ≈ 5.0GB leaves ~3GB free, sufficient for 2 samples of DiT activations (~0.8GB each at 60s). Updated gpu_config.py and docs (en/zh/ja).

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

chuxij and others added 6 commits February 10, 2026 13:30

Merge branch 'main' into feat/gpu-compatibility-tier-boundary-testing

0a59e6d

Merge branch 'feat/gpu-compatibility-tier-boundary-testing' of https:…

266fbfd

…//github.com/ace-step/ACE-Step-1.5 into feat/gpu-compatibility-tier-boundary-testing

merge: resolve handler.py conflict with main

81e0187

Keep _vram_guard_reduce_batch (our feature). Remove _start_diffusion_progress_estimator (now provided by ProgressMixin from acestep/core/generation/handler/progress.py).

ChuxiJ merged commit d1090f5 into main Feb 11, 2026
1 check passed

This was referenced Feb 11, 2026

Lora worked great until relaunch, then distorted garbled mess #434

Open

feat(mlx): Native MLX backend for DiT diffusion on Apple Silicon (2-3x speedup) #439

Merged

refactor: optimize audio encoding logic with caching enhancements #446

Merged

	print(f"\n --- Variant: default ---")
	print("\n --- Variant: default ---")

		# 测试所有等级 (4, 6, 8, 12, 16, 20, 24 GB)
		python profile_inference.py --mode tier-test

feat: GPU compatibility tier system with boundary testing #417

feat: GPU compatibility tier system with boundary testing #417

Uh oh!

Conversation

ChuxiJ commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

GPU Tier System (acestep/gpu_config.py)

Boundary Testing (profile_inference.py)

Boundary Test Results

Handler & UI Updates

Documentation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

ChuxiJ commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading